-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: fix fileExistsOnFirstNode
check
#105514
Conversation
For some reason, the current form of `fileExistsOnFirstNode` can return `found=true` when it should return `found=false`. This can be reproduced by running the `multitenant-upgrade` roachtest and seeing it hang at: ``` multitenant_upgrade.go:154: test status: checking the pre-upgrade sql server still works after the system tenant binary upgrade ``` because of a `TLS handshake error`. Note: the test is also broken because of another issue so with this fix it should now fail with: ``` (assertions.go:333).Fail: Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:390 github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:189 github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:38 main/pkg/cmd/roachtest/test_runner.go:1060 GOROOT/src/runtime/asm_arm64.s:1172 Error: Not equal: expected: [][]string{[]string{"23.1"}} actual : [][]string{[]string{"22.2"}} Diff: --- Expected +++ Actual @@ -2,3 +2,3 @@ ([]string) (len=1) { - (string) (len=4) "23.1" + (string) (len=4) "22.2" } Test: multitenant-upgrade ``` Release note: None Epic: none
8629e1a
to
305fcae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, the current form of
fileExistsOnFirstNode
can returnfound=true
when it should returnfound=false
.
The reason for that is $?
which was substituted by bash
prior to executing the commands passed via -c
. It needed to be escaped; e.g.,
bash -c "test -e '/foobar'; echo $?"
0
bash -c "test -e '/foobar'; echo \$?"
1
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This possibly caused the failure on #105176 as well.
I'm in favour of resolving this as soon as possible to avoid any other unnecessary failures.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @smg260)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I exposed the fileExistsOnFirstNode
function on the roachprod cli
to put this through a few scenarios. The original $(test -e ...); echo $?
command seems to work, even without the escape (there's an ssh escape down the line that might be the reason for this).
The flaw in the previous revision is actually much more sinister. There's a \n
in Stdout
that makes the rudimentary result.Stdout == "0"
insufficient. It will always state the file as not being present.
The fix to the previous revision is then obvious: strings.TrimSpace(result.Stdout) == "0"
But in hindsight, perhaps it's not best to rely on the output, unless the matching is improved and not error prone.
Checking for other unexpected error codes is also a nice addition.
Thanks for the fix! I'm for reverting back to a command failure that does not log to make this more robust, unless anyone has any objections.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @smg260)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: If the check always states 'not found' as true, then it does make me wonder about the details of the test failure if the certs are always distributed and how that causes a problem?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @smg260)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix for this was also included in the first commit of #105454 (using strings.TrimSpace
). I kind of prefer the simplicity of that as the echo $?
command would only equal "0" if the command succeeded and the file exists. Roachprod errors (err != nil
) are returned to the caller as expected.
That said, I'm fine with this change as well and with merging it sooner. Thanks for fixing!
if (result.RemoteExitStatus != 0 && result.RemoteExitStatus != 1) || err != nil { | ||
// Unexpected exit status (neither 0 nor 1) or non-nil error. Return combined output along with err returned | ||
// from the call if it's not nil. | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic could be simplified by lifting the shared error (running '%s' ...
) and calling Wrapf
if err != nil
testCmd := `$(test -e ` + path + `);` | ||
// Do not log output to stdout/stderr because in some cases this call will be expected to exit 1. | ||
result, err := c.runCmdOnSingleNode(ctx, l, c.Nodes[0], testCmd, true, nil, nil) | ||
if (result.RemoteExitStatus != 0 && result.RemoteExitStatus != 1) || err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does result.Err
play a role here? Shouldn't we check for it too? (cc @smg260)
I think it's the reverse -- the function would return 'found' as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @renatolabs)
pkg/roachprod/install/cluster_synced.go
line 1537 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
How does
result.Err
play a role here? Shouldn't we check for it too? (cc @smg260)
Yes. When running commands over ssh, via runCmdOnSIngleNode
, res.Err
is more relevant, since we don't expect the standalone err
to non-nil very often. Since we're checking for 1 or 0 to covers an acceptable res.Err
(file not existing), we should prob do something like (roughly):
res,err := runOnSingleNode(..)
if err != nil return false, err // always return the non command error immediately
// special cases for `test -e`
if res.exitStatus == 0 return true, nil
if res.exitStatus == 1 return false, nil
else
// any other exit code we map to an error for the caller
return false, res.Err
TFTRs! bors r+ |
Build succeeded: |
For some reason, the current form of
fileExistsOnFirstNode
can returnfound=true
when it should returnfound=false
.This can be reproduced by running the
multitenant-upgrade
roachtest and seeing it hang at:because of a
TLS handshake error
.Note: the test is also broken because of another issue so with this fix it should now fail with:
Release note: None
Epic: none